Bioinformatics (Thomas Dandekar, Meik Kunz)

203

this, we need large amounts of data (a “training data set”) and feedback to the neuronal

network (by itself: unsupervised learning; from the outside: supervised learning) as to

whether the computer’s prediction was correct or incorrect afterwards, even for individual

molecules or sequences (predictions, for example, for the secondary structure in the pro

tein, for the localisation in the cell, etc.).

Genetic algorithms are a sophisticated search strategy that I myself have used enthusias

tically for many years. Here, solutions are bred in the computer with the help of artificial

evolution through selection, mutation and recombination of digitally programmed chromo

somes. These chromosomes then encode the problem you want to solve. This works sur

prisingly well, given sufficient populations of individuals and several hundred generations

of evolution. For example, one can obtain protein structures from the sequence by using

appropriate selection parameters with small error to the observable structure (Dandekar and

Argos 1994, 1996, 1997). The “catch” with this approach is only how to code the protein

structure efficiently enough in the chromosomes (e.g., by “internal coordinates”) and how

to design the selection “correctly” (many years of work and then requires a sufficient num

ber of known, experimentally resolved crystal structures). Another clever search strategy

for complex problems with a huge, often high-dimensional search space is to do like the

ants (ant colony optimization). Here, an anthill is electronically programmed, and the indi

vidual virtual ants scour the solution space. In doing so, they leave behind a scent trail. This

trail is amplified and turned into a virtual ant trail in the computer if there are particularly

good solutions along the searched route. This method is also surprisingly powerful for

complex problems, but also requires a lot of patience until one has sufficiently mapped the

problem one wants to solve in the real world into this virtual “forest of ants” so that the

solutions are tractable. A breakthrough in predicting 3D structures of proteins was recently

achieved by Senior et al. (2020) and Tunyasuvunakool et al. (2021).

14.3

Current Applications of Artificial Intelligence

in Bioinformatics

The high-dimensional data in biology and medicine contain various variables (features),

e.g. diagnosis, expression values, age, weight. In addition, there are complex relationships

and correlations, but also confounders (confounding variables), batch effects and multicol

linearity between the variables. In short, it is very time-consuming to find out which vari

ables are relevant and which are not. An application from artificial intelligence research

that has been used for a long time is machine learning (machine learning; Tarca et al. 2007;

Sommer and Gerlich 2013) in bioinformatics to structure the data and extract relevant

features, but also to develop classification models (predictive models). We have already

learned about PCA (Chap. 7) to decompose high-dimensional data into principal compo

nents and reduce their complexity (dimensionality reduction). Other methods are cluster

and regression analyses. While cluster analysis is used to classify data into groups (clus

ters) with similar characteristic structures (characteristics), regression analysis is used to

find correlations and relationships between variables.

14.3 Current Applications of Artificial Intelligence in Bioinformatics